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1. OBJECTIVES 


The purpose of this research project is to refine the current 
notion of system reliability by identifying and investigating attributes 
of a system which are important to reliability considerations and to 
develop techniques which facilitate analysis of system reliability. 
Attributes selected for investigation included: 

(a) Fault tolerance - the ability to maintain error -free input - 
output behavior in the presence of (temporary and/or 
permanent) faults in the system 

(b) Diagnosability - the ability to detect and locate faults in the 
system 

(c) Reconfigurability - the ability to reconfigure the system after 
the occurrence of a fault so as to realize the original behavior 
or some other (possibly less complex) behavior 

. with the following proposed objectives: 

I. To determine, relative to the above attributes, properties 
of system structure that are conducive to a particular attribute. 
Structures so considered will range from state -transition functions at 
one extreme to hardware and software realizations at the other extreme. 

II. To determine methods for obtaining reliable realizations of 

a given system behavior. In particular, one would like to obtain reali- 
zations which are fault tolerant (relative to the specified behavior) and 
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yet diagnosable (relative to some extended behavior). 

III. To determine how properties of system behavior relate to 
the complexity of fault tolerant (diagnosable, reconfigurable) realiza- 
tions. Once such relationships are discovered, the inherent fault 
tolerance (diagnosability, reconfigurability) of a given behavior could 
be measured by the minimum complexity of realizations possessing 
that reliability attribute. 

IV. To determine methods for evaluating the reliability of a 
proposed or existing system as measured in terms of fault tolerance, 
diagn os ability, reconfigurability, or combinations of these attributes. 
This includes the investigation of appropriate reliability measures, 
modeling techniques, and computational methods for determining, or 
at least estimating, system reliability. 

Since the initiation of the grant, the above proposed objectives 
have been augmented to obtain a more definitive statement of what 
research should be accomplished to meet the needs of NASA and, 
in particular, the Langley Research Center. The following statements 
of these augmented objectives are due primarily to the constructive 
suggestions of NASA -Langley, with some subsequent modification in 
wording to conform more closely with our interpretation; 

I. To develop formal concepts and establish mathematical results 
which can be used to precisely define measures of system utility, e. g. : 

1. Measures of fault tolerance; 

2. Measures of recoverability based on measures of detectability, 



3 


locatability and reconfigurability; 

3. Measures of system availability with respect to different 
levels of system performance; 

4. Measures of total system TT worth 1T based on measures of 
performance worth and measures of performance availability. 

II. To develop analytic and simulation methods for evaluating 
system utility measures. 

III. To determine architectural characteristics of fault -tolerant 
systems that are amenable to fault detection and fault location. 

IV. To investigate methods of on-line diagnosis that are appli- 
cable to specific subsystems of a fault -tolerant computing system, e. g. 

— given an arithmetic unit subject to a specified class of faults, 
design a detector that, with a specified allowable time delay, 
will detect any error produced by a fault. 

V. To investigate methods of augmenting the structure of 
specific hardware or software subsystems in order to facilitate detector 
design and improve on-line diagnos ability. 

2. PERSONNEL 

To meet the objectives stated in Section 1, it was estimated that 

the following technical effort would be required: 

Principal Investigator 

25 percent time, academic year 

100 percent time, two months, summer 
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Research Assistants 

1 at 50 percent time, academic year 

2 at 25 percent time, academic year 

3 at 100 percent time, summer 

Programmer 

25 percent time, fiscal year 

During the period 1 January -30 June 1974 (referred to as the "reporting 

period") research personnel and their level of effort have been: 

Principal Investigator 
John F. Meyer 

25 percent time, January - May 
100 percent time, June 

Research Assistants 

David E. Frisque 

20 percent time, January - April 

100 percent time, May - June 

Carolyn P, Steinhaus 

13 percent time, February - April 

100 percent time, May - June 

Robert J. Sundstrom 

54 percent time, January - April 

100 percent time, May - June 

3. TECHNICAL STATUS 

In proposing the research activity to be conducted under the 
subject grant, several specific investigations were proposed for con- 
sideration during the year. Of the proposed investigations, the two 
focused on during the reporting period were: 

(1) Reliability Analysis - Determine appropriate measures of 
system reliability that can be evaluated relative to some specified 
level of structural description, with initial emphasis on the architectural 
level; develop models for reliability analysis, with respect to the above 
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measures; and develop simulation models and computational methods 
for evaluating these measures. What we eventually seek are programmed 
reliability evaluation procedures that can be used interactively, during 
the process of system design, to compare the reliability of various 
design alternatives. Such procedures could also be used to compare 
the reliability of various existing systems. The evaluation procedures 
should be general enough to accommodate new schemes for reliability 
enhancement, in addition to well-known techniques. This is to be 
contrasted with CARE [ 1 ], for example, which was designed specifically 
for the evaluation of modular redundancy and standby -sparing schemes. 

(2) On-line Fault Diagnosis - Determine structural and 
behavioral properties of systems that are conducive to their ’ ’on-line” 
diagnosis; investigate techniques (other than duplication) for implement- 
ing on-line diagnosis; and determine methods for altering the design 
of a system to improve its on-line diagnosability. As contrasted with 
"off-line” diagnosis, an on-line diagnostic procedure must contend 
with (i) system input over which it has no control and (ii) faults that 
occur as the system is being diagnosed. To account for these compli- 
cating factors, the study will be based on a representation of faulty 
digital systems as ’’resettable discrete -time systems, " first intro- 
duced in [ 2] . 
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3. 1 Reliability Analysis 

3. 1. 1 Background. A review of the current state of the art of 
reliability analysis reveals a situation common to relatively new fields, 
namely, the tendency to hold on to concepts and methodologies that were 
introduced when the field first began to develop. Early objects of reliability 
analysis were simple systems, at least in a functional sense, and simple 
measures could be used to determine their reliability. In a period of 30 
years, however, system complexities have grown to that of a large 
multiprocessing computer or a complex operating system, while reli- 
ability measures have remained almost as simple as they were when 
applied to a radio receiver. Of course, the measures are now much 
more difficult to evaluate and consequently, in the area of computer 
reliability analysis, much of the recent research effort has focused on the 
derivation of formulae for calculating the values of traditional reliability 
measures such as "probability of success. " This is not to deny the 
importance of "probability of success" as a measure; indeed, when the 
term "reliability” is narrowly interpreted it is usually given this mean- 
ing. However, in the analysis of systems with complex behavior, what 
constitutes "success” or "failure” can likewise be very complicated. 

It is this fact that is often overlooked when complex systems are analyzed 
using relatively simple reliability measures. 

For example, a paper by Bouricious, et al. [3] presents the 
following formula for the reliability of a stand-by sparing configuration 
with N active units and S unpowered spares: 
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r = e' NXt 2 ( k - 1 ' ) C k (l - 
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where each active unit is assumed to have an exponential failure rate A, 
and each spare has an exponential failure rate /x. C is the coverage. 
Although the failure criterion for this model was not stated, our analysis 
indicates that they assumed that all N active units must be functioning, 
and if a single active unit fails after the replacements have been 
exhausted, the entire system fails. 

Mathur did a similar type of analysis [ 4 ] using less stringent 
failure criteria. He defines a hybrid system to be one which behaves 
as a simple NMR core after all of the spares have been depleted, so 
that the system fails only when there remain less than (N + i)/2 
unfailed modules. Mathur’s equation for the reliability of this configura 
tion is, for S > 1 
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Clearly, considerable effort has gone into reliability analyses 
of this type. In fact, most of the combinatorial problems seem to have 
been solved, and in general, papers published in the last three years 
have essentially been restatements of previous results, with some 
minor modifications at best. Also, in the reliability analysis of both 
computer hardware and computer software, there has been a tendency 
in the past to blur the distinction between faults and errors, and to 
treat all faults identically, ignoring the fact that different classes of 
faults may have very different effects on system behavior. As a 
consequence, reliability computations that are based on such analysis 
methods may be quite misleading. Moreover, depending on the system 
being analyzed, the computations might be optimistic in one case and 
pessimistic in another. What is needed, then, is a more refined 
analysis model wherein the concepts of "fault" and "error" are 
distinguished, and wherein the internal "state" of a system can be 
accounted for when determining whether a fault causes an error. 

The problems reviewed here are not presented as particular 
difficulties which this research effort intends to solve, but rather as 
examples of the type of problem which results from the fact that 
reliability analysis, as afield, has yet to agree on what concepts 
are central to the evaluation of the reliability of complex systems and, of 
course, how such concepts shouldbe precisely formulated. It is the 
feeling of this research effort that a more comprehensive 
investigation of basic reliability concepts is necessary before 
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further meaningful work can be done at a more detailed 
level, 

3, 1, 2 Computers with Faults. To establish a more refined 
analysis model of the type suggested above, let us begin by viewing a 
digital computer as a rather general type of system which, at discrete 
points in time, receives input data which, in turn, effects changes in 
the system’s internal state. It will be assumed that the state set is 
"coordinatized" where a subset of the coordinates represent the values 
of those state variables that are observable as output variables. I'he 
transition structure of such a system may vary with time because faults 
occur or because the system is reconfigured in an attempt to recover 
from a fault. At a given instant of time the structure is fixed, however, 
and is described by a transition function which determines the state of 
the computing system at time i + 1, given the state at time i and the 
input received at time i. Formalizing this notion we have : 

Definition: A (formal) computer is a system 

C = (X, Q, A) 

where 

X is a nonempty set, the input set of C, 

Q is a nonempty set, the state set of C, 

A is a sequence of functions 

A = (6 q , 6 ^ 6 2 , . . . ) 
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where 5^: Q x X Q, the transition function of C at time i 

<1 = o, 1, 2, — ). 

Thus a computer, as defined above, is a discrete -time, time - 
varying system whose structure at time i is described by transition 
function 6^ (i =■ 0, 1, 2, . . - ). In particular, if q e Q is the state of C at 
time i and a e X is the input received at time i then 6^(q,a) is the 
state of C at time i + i. In case structure does not vary with time, 
that is, 

5 = 6 i = 0,1, 2,... (3.1) 

i+l i 

then C is time -invariant. Thus if C = (X, Q, A) is time -invariant, 

A is uniquely determined by 6 Q and C can alternatively be regarded 
as a (state) sequential machine with (fixed) transition function 6 = 5 q. 

A computer is finite -input if |x | < oo and finite -state if 
| Q f < co. Note that even in case a computer is both finite -input and 
finite -state , it is not finitely specifiable unless its structure A is finitely 
specifiable. However, in the subsequent application of this model to 
reliability analysis, all computers (both fault -free and faulty) of 
concern in the analysis will indeed be finitely specifiable. 

The most general view of computer behavior is that of "string 
manipulation. " Beginning in some initial state q^, determined by the 
program to be executed and by stored data, C receives an input sequence 
(string) of symbols a^a^. . • a n _i where a i e X (i = 0, 1, — , n-1) is the 
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input received at time i. In response to this input sequence, the 
computer will pass through a sequence (trajectory) of states 


where q. e Q (i = 0, 1, « . . , n) is the state of C at time i. Thus the 
"state behavior" of C may be viewed as a function from the set X* of 
all finite -length sequences of input symbols (including the null sequence 
A) into the set Q + of all finite -length sequences of states. More precisely, 
if C = (X, Q, A) and q e Q, the state -behavior of C in q is a function 


a : X* — > Q defined inductively as follows: 
Q 

i) a (A) = q, for all q e Q, and 

ii) a (xa) = a^(x)6.(q T , a) 


(3.2) 


where i = lg(x) (the length of x) and q' is the final state of the trajectory 
a (x), for all q e Q, x e X* and a e X. It is easy to verify that this 
formal notion of state -behavior captures the intuitive notion discussed 
above. Note that a maps input sequences of length n into state 
trajectories of length n + 1. 

Having established the concepts of "computer" and "state -behavior, 
we adopt a concept of "computation" that is somewhat more general 
than usually considered. Since computational errors may be due to 
faulty initial states or erroneous input symbols as well as to faulty 
computers, we regard computation as consisting of three things: an 
initial state q, an input sequence x and a state sequence y. More 
precisely a computation (over X and Q) is a triple (q,x,y) where q e Q, 
x e X* and y e Q + such that lg(y) = lg(x) + 1. Accordingly, q, x, and y 
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are referred to as the initial state , input sequence and state trajectory 

(respectively) of the computation . Relative to a particular computer 

C, a computation of C is a computation of the form (q,x, a (x)). The 

Q 

fundamental question of deciding whether a computer is behaving 
within specified tolerances will be based on the nature of such computa- 
tions. However, even more basic than the notion of a computational 
error is the concept of a "fault, " that is, a transient or permanent 
change in structure that may, in turn, cause errors. 

Assuming a familiarity with the concepts of a "representation 
scheme" and a "system with faults” (see [ 5] , [6] ) , the specification 
class c5* and realization class & that we choose in this case is the 
class of all computers (as defined above), that is, both & and (R are 
equal to the class 

<e = {c ic is a computer} . 

Moreover, we will restrict our attention to faults that occur during 
the use of a computer (as opposed to faults that occur during the design 
process) and so, in the representation scheme ( ^ , p), 

P is taken to be the identity function. Accordingly a computer with 
faults is a system 

(C, F ,<p) 

where C e , F is a set of potential faults of C and cp: F — 

where, if f t F, cp( f) is the computer that results from fault f. 
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<p( f) will alternatively be denoted C . One should be careful not to 
interpret a "fault" f as some single physical failure that occurs in 
the underlying system. Instead it should be interpreted as an entire 
sequence of physical failures that could occur during the utilization 
of the system. That part of a fault which changes the structure of 
a system at some particular instant of time, say time i, will be 
referred to as a "fault at time i n and will be more precisely defined 
in a moment. 

Given the general concept of a computer with faults, let us now 
introduce certain restrictions that bring the concept closer to reality 
and result in a model that can be used for reliability analysis. If 
(C, F, cp) is a computer with faults it will be assumed that the fault - 
free specification C is time -invariant (see condition (3. 1)). This is 
not unreasonable since many physical systems and, in particular, 
most computing systems can be represented as time -invariant systems 
as long as there are no structural changes due to physical failures. 
Suppose now that a physical failure occurs where the failure may be 
transient, permanent, or a combination of the two, that is, a permanent 
physical failure that has a transient component while the permanent 
change is taking place. Such physical failures can be represented by 
(formal) faults as follows. If C = (X, Q, A) is a computer, a fault of C 
(at time i) is a triple (r, tt, i) where 
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r: Q x X — > Q, the transient component , 

7 r : Q x X — > Q, the permanent component , 
i is a nonnegative integer, the time of occurrence. 

The interpretation of (r, tt , i) is a physical failure* that occurs between 
time i and time i + 1. r is the transition function that the failing system 
exhibits while the failure is taking place and u is the transition function 
that the system exhibits after the failure has taken place. Thus, if 
f = (t, 7r, i) is a fault of computer C = (X, Q,A), the result of f is the 
computer C* = (X, Q, A*) where, if A* = ( <5 q, <5*, 6^, . . . ) then 

r if 0 < j < i 

5? = < r if j = i (3.3) 

J 

^ v if j > i . 


If, in the result of f = (r, tt , i), there is no permanent change in struc- 
ture, that is, tt = 6. i then f is a transient fault (at time i). A fault 
(tt , r, i) which represents no change whatsoever, that is, 7 r = 
and r = 6. , is referred to as a null or improper fault (at time i). 
Finally, as discussed earlier, we want the general concept of a "fault” 
to include the representation of a succession of physical failures that 
occur during the utilization of the system. Thus, in general, a 
(multiple) fault of C is a sequence 


f 


= (f. ,f. , 

l l l 2 
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where i. < u< ... < i. and f. is a fault of C at time i.. The 

12 “ l j J 

corresponding result of f is an immediate generalization of definition 

(3. 3). 

To summarize, then, the computers with faults that we shall 
consider are of the form (C, F,<p) where C is a time -invariant computer, 
F is a set of faults of C (including at least one null fault), and 
cp\ F — > where (p{i ) is equal to the result of f (as defined above). 

To illustrate the concepts developed above, let us consider a 
trivial example. (It must be emphasized that this and future examples 
are not intended to illustrate the full power of the formalism. They 
are simply given as an aid to the intuitive understanding of the defini- 
tions and results. ) 

Consider a TMR. configuration 



Where each module has a (fault -free) transition function 

6: Q x X — > Q where X = Q = {0, l}. Then the fault -free TMR confiration 

is represented by a computer 


C = ({0, l} , Q, A) 
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where 

Q = {(q 1 .q2’ q 3’ q 4 l ! q i e {°» ^ 

with q^, (±2 an( ^ Q 3 representing the states of modules 1, 2 and 3 
(respectively) and representing the value of the voter output. 

The transition structure 

A = (6 q, $ v 6 2 » ) 

is given by a fixed function 5=5^ for all i, where 

6((q 1 ,q 2 .q 3 >q 4 )»q) = 

with q! = 6 *(q^, a) and /i equal to the majority function (realized by the 
voter). 

Suppose now that at time 2 there is a transient struck-at-one failure 
at the output of module 1 and at time 4 there is a permanent stuck -at - 
zero failure at the output of module 3. Then this succession of 
failures is represented by the (multiple) fault 

* = < f 2- f 4 ) 

where f 2 is the fault at time 2 and f^ is the fault at time 4. More 
specifically, f^ is the fault 


(t 2 , n 2 ,2) 
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where (letting q| = <5(q., a)): 

^2* ^3* ^4^ = ^2* ^3 5 ^ ^2 } ^3^ 


and 



f , is the fault 
4 


(T 4 . lr 4’ 4) 

where : 


^ 4 ((Q i? ^2> ^3> 


(q T r q^O,^q f r q^O)) 


and 


7r, = r , . 

4 4 

f f 

The result of the fault f is the computer C = ({0, l},Q, A ) 
where 

' 5 if 0 < j < 2 
r 2 ifj=2 
5 f = l n 2 UJ =3 
r 4 if j = 4 
^ » 4 « j >4 . 
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3, 1. 3 Tolerance Relations and Erroneous Computations. Let us 
now consider the effects of faults on computer behavior, i. e. , the compu- 
tational errors that may be caused by faults. Recall that, in general, 
a computation (over X and Q) is a triple (q,x, y) where q is the initial 
state, x is the input sequence and y is the state trajectory of the 
computation. What we seek is a basis for comparing computations 
to determine whether an actual computation (q,x, y) is within "tolerance” 
of the desired computation (q f ,x r ,y T ). More formally, if U is the 
class of all computations (over X and Q), a tolerance relation (for 
computations) is a relation T on the set U such that T is reflexive. 

If (u, u T ) e T we will write uTu r with the interpretation that, from the 
user's point of view, actual computation u is within tolerance of 
desired computation u\ The reflexive condition says simply that 
every computation is within tolerance of itself, which is certainly a 
reasonable requirement. Accordingly the strongest tolerance rela- 
tion is the relation of equality; the weakest is the relation T = UX U 
where every computation is within tolerance of every other computation. 
The latter says that anything the computer does is acceptable and 
therefore represents a theoretical extreme as opposed to a practical 
one. 

In specifying a tolerance relation, one is able to specify toler- 
able changes in initial state or tolerable changes in input as well as 
tolerable changes in state trajectory. However, if a system is assumed 
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to be free of initialization and input faults, it is convenient to consider 
tolerance relations T that only permit changes in the state trajectory 
of a computation. More precisely, if u, u r e U where u = (q,x,y) and 
u T = (q f ,x ! , y r ) then T satisfies the condition: 


uTu* implies q = q f and x = x* (3. 4) 

If a tolerance relation T is so restricted, it follows that whenever u 
is not in tolerance of u f , the state trajectories of u and u f must differ. 
During the current reporting period we have confined our attention to 
tolerance relations of this type and it will be assumed that condition 
(3. 4) is satisfied, unless otherwise qualified. 

Suppose now that (C, F, cp) is a computer with faults and a specified 
tolerance relation is being used to determine the computational integrity 
of computers that result from faults. In particular, suppose f e F 

f 

and u is a computation of the faulty computer C , that is, for some 
q e Q and x e X*, 


u = (q,x, a (x)) 
f f 

(a? is the state -behavior of C in q; see (3. 2).) Since the desired 

q 

computation is the computation performed by the fault -free computer 
(for the same q and x), that is, the computation 


u r = (q,x, a (x)) 

HI 
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it is reasonable to regard u as "erroneous” if u is not within tolerance 
of u\ More precisely, if T is a tolerance relation and f e F we have: 

Definition: A computation u of C is T -erroneous if u7ta> x > a q( x )) 

where q and x are the initial state and input sequence of u. Other- 
wise u is T -error -free. • 

When the tolerance relation is understood, we will drop the reference 
to T and refer to u as simply '’erroneous” or, in the opposite case, 
"error -free. " We will also say that a T -erroneous computation of 

f 

C is caused by f. Finally, if f can cause no T -erroneous computations 
then f is T -tolerated . 

It should be noted that the concept of a T-erroneous computation 
can capture the notion of an internal error as well as an input -output 
error. To illustrate, consider the TMR example used earlier and 
suppose T is the relation of equality (identity) on U, that is 

uTu' if u = u T . 

Then the fault f = considered in the earlier example, can 

cause T-erroneous computations even though f cannot cause input - 
output errors (assuming the modules are properly initialized). To 
be more specific, let us suppose the module transition function 5 
is given by : 
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(q, a) 

<5(q, a) 

(0, 0) 

0 

(0, 1) 

1 

(1, 0) 

1 

(1, 1) 

0 . 


Then, for example, if q = (0, 0, 0, 0) and x = 101 then 

a* to = (0, 0, 0, 0)(1, 1, 1, 1) (1, 1, 1, 1)(1,0, 0, 0) 

4 

since the transition function at time 2 is t^. On the other hand 


a to = (0,0,0,0)(1, 1,1, 1)(1, 1,1, 1)(0, 0,0,0). 


Thus the computations u = (q,x, a (x)) and u f = (q,x, a (x)) are not 

equal, that is, u^u f and hence u is a T -erroneous computation. 

To continue the example, suppose T f is a second tolerance 

relation which requires only that values on the output line (coordinate 

4) be what they should be. More precisely, (q,x, y)T f (q,x, y T ) if y and 

y 1 have the same length, say n, and the i^ state of y has the same 4^ 

ttl 

coordinate as the i state of y', i = 0, 1, . . . , n-1. Given that T f is 
the tolerance relation of interest, it can be shown that the fault 

V. 

f = (f 2 > f^) does not cause any T f -erroneous computations (provided 


all module states are the same when the computation begins). In 
other words, although f can cause internal errors (according to 
tolerance relation T), it can cause no input -output errors (according 
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to tolerance relation T f ). 

In general, any analysis of computing system reliability must 
be based on some underlying criteria that defines "acceptable” or 
"tolerable” behavior. Since a tolerance relation is simply a formal 
statement of such criteria, it too is fundamental to reliability 
analysis. In many cases, the tolerance relation that underlies the 
analysis is not stated explicitly but is nevertheless easily inferred. 

In other cases, however, it is extremely difficult to judge what the 
analyst regards as tolerable behavior and, consequently, the results 
of the analysis are difficult to interpret. 

When the underlying tolerance relation (or some equivalent 
thereof) is not stated explicitly, results of the analysis can also 
be misleading. For example, in a paper by Mathur ([7], 1971), the 
problem of optimally allocating 7 identical modules is considered 
with the conclusion that a standby replacement system (i. e. , 1 active 
unit and 6 unpowered spares) is . . clearly. . . superior to hybrid 
systems. . . ,? (i. e. , (3, 4) or (5, 2), with voter). What is ignored is 
the fact that different tolerance relations are used to calculate the 
respective reliabilities. With a single active unit an error yields an 
incorrect output, while with the hybrid system an error results in an 
incorrect internal state, but the output is still correct. The importance 
of this distinction is, of course, application dependent, but certainly 
the distinction must be kept in mind. 
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Based on a survey of reliability measures and analysis techniques 
that are currently being used, it is our firm belief that the appropriate 
starting point for reliability analysis is a precise definition of 
what constitutes ’’acceptable” or ’’tolerable” behavior. Moreover, the 
definition should be general enough to permit the specification of 
relatively complex tolerance criteria involving multivariable descrip- 
tions of performance, various levels of degraded performance, limits 
on the duration of time and/or number of times that performance is 
below a given level, and so forth, It is these considerations that 
have motivated our development up to this point, culminating in the 
concept of a ’’tolerance relation” which permits this kind of precise 
specification. Accordingly, the subsequent investigation of reliability 
analysis is based on the assumptions that the computer to be analyzed 
is formally described as a computer with faults and tolerable behavior 
is formally described by a tolerance relation defined on computations. 

3. 1.4 Reliability Measures. In analyzing the reliability of 
a computing system, one must first specify just what is meant by the 
term "reliability” since the word has taken on a variety of special 
meanings. Generally, by reliability we will mean a sequence of one 
or more numbers that reflect the ability to rely on a system. In 
particular, if the system is a computer with faults, the numbers 
reflect the ability to rely on the computations of the computer. 
Accordingly, a reliability measure is a function from systems into 
sequences of numbers whose value, for a particular system, is the 
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reliability of that system. Thus, in the context of this investigation, 
a reliability measure is a function from the class of all computers 
with faults into some Cartesian product of sets of numbers. In many 
cases, the Cartesian product is one -dimensional, that is, the value of 
the measure is a single number. If the product has more than one 
dimension, the measure will be referred to as multi -dimensional. 

When used in the above sense, reliability has a very general 
meaning and includes such concepts as mean -time -to -failure, availa- 
bility, recoverability, effectiveness, etc. as well as the usual strict 
definitions of the term that are based on some concept of successful 
(adequate, acceptable, tolerable) operation. Thus the term TT re liability,' 
as used above, is synonymous with the term "utility" as used in the 
statement of the augmented objectives (see Section 1). Unfortunately, 
neither term is especially we 11 -suited for the meaning given here since 
reliability is often given a more restricted meaning and utility a more 
general meaning. 

To begin our investigation of reliability measures for computers 
with faults, let us consider first how the usual, more restricted 
meaning of "reliability" translates into the terms we have developed. 

As defined, for example, by the Radio -Electronics -Television Associa- 
tion in 1955, reliability is the "probability of a device performing its 
purpose adequately for the period of time intended under the operating 
conditions encountered. " Translating into our terms, this 



becomes the "probability that a computer with faults behaves within 
tolerance for some specified utilization period. ,T More precisely, 
if C = (C, F ,<p) is a computer with faults, T is a tolerance relation 
on computations, and [0, t] = {0, 1, . . . , t} is the utilization period 
then 

Definition: The (strict) reliability R(C) of C is the probability that 
the computation of C in the interval [0,t] is T -error-free. 

The reliability measure, in this case, is the function R from 
computers with faults into the real interval [ 0, 1] , where R(C) is the 
reliability of C. 

Let us now examine what suffices to compute the values of this 
measure. Conceptually, for each computer with faults C = (C, F ,cp), 
we must determine an underlying probability space that will 
suffice to determine the reliability R(C). More specifically, if §P r = 
(S, S, P), where S is the sample space, S is the event space (a 
a -algebra of subsets of S) and P is the probability measure, we must 
determine choices of S, S, and P that will determine R(C). Beginning 
with the sample space S, elements here must represent elementary 
outcomes, namely, computations of C. Since a computation of C is 
actually a computation of C for some f e F, it suffices to let S be the 
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where X* = {x |x e X* and lg(x) = t}. The interpretation of a sample 
(q,x,f) is that the initial state is q, the input sequence is x and the 
(multiple) fault that occurred is f. Thus, given a sample (q,x,f) 
the corresponding computation (in the interval [0, t] ) is the computa- 
tion 

(q,x, a (x)). 

The event space & must be chosen so that it contains the event 

E = {(q,x, f) j (q,x, a*(x)) is T-error-free} (3.6) 

4 

and permits the definition of a probability measure P: S — > [0, 1] 

(with the usual interpretation that, for all D e S\ P(D) is the 
probability that the outcome is in the event D). It follows then that 
3^ suffices to determine R(C) since, if E is the event given above, 
then 

R(C) = P(E) (3.7) 

(Technically speaking, the above equation should be regarded as the 
definition of R(C), for it is here that we give precise meaning to 
the word "probability. ") 

Upon closer examination of this reliability measure, the reason 
for the careful development of the preceding sections should now be 
clear. We note first that the underlying probability measure P is 
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defined on initial states, input sequences and faults, and not simply 
faults alone. This permits initial -state -dependent or data -dependent 
performance to be accounted for in the reliability analysis. Most 
conventional approaches, on the other hand, account only for the 
probabilistic nature of faults, thereby ignoring effects of initialization 
and data. 

Second, we note that what constitutes "success" (as the term 
is usually employed in strict definitions of reliability) is precisely 
specified by a tolerance relation T. Moreover, the tolerance relation 
is not restricted to instantaneous "snapshots” of structure or behavior 
but, instead, is defined on complete computations. This permits the 
past history of the computation and, in particular, the present state 
of the computer to be accounted for in the judgment as to whether 
performance is successful. 

To illustrate these remarks, let us suppose the "computer" to 
be analyzed is a simple two -state device, namely, a trigger flip-flop 
(alternately referred to as a T-flip -flop or mod-2 counter) . The 
fault -free representation of this device is the computer 
C = ({0, 1},{0, 1}, 6) where 5 is given by the table : 


(q, a) 

<5(q, a) 

(0, 0) 

0 

(0, 1) 

1 

(1, 0) 

1 

(1, 1) 

0 
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that is, C stays in the same state if the input is 0 and changes 
state if the input is 1. Suppose further that the only device failures 
having an appreciable probability of occurrence are stuck -at failures 
at the output where the hazard rate (failure rate) is a constant A. Then, 
by the usual method of reliability analysis (see [8 ] , for example), 
the reliability R(C) is given by the equation: 


R(C) = e 


-At 


( 3 , 8 ) 


where [0,t] is the utilization period, (The notation R is used to 
distinguish this measure from the measure R defined above. ) What 
is being assumed here is that the flip-flop fails as soon as a stuck -at 
failure occurs. This certainly simplifies the measure but, at the 
same time, ignores the effects of initial state, input, the transition 
function and, most importantly, what the user regards as tolerable 
behavior. 

Let us now examine how our more refined measure R can account 
for effects ignored by R. In particular, let us suppose that the 
user is interested in the value of the output only when the computation 
terminates. In other words, the tolerance relation is the relation T 
where (q,x, y)T(q,x, y T ) if the final state of trajectory y is equal to 
the final state of trajectory y T . If, further, we let F be the set of all 
faults of C that represent (permanent) stuck-at failures and f e F, it 

f 

follows that the computation (q,x, a (x)) is T-error-free if and only if 

f 

the final state of a (x) is equal to the final state of a (x). Accordingly 
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the event E, corresponding to all error -free computations (see 
(3. 6)), is the event 

E = {(q, x, f) | Final state of (x) = Final state of a (x)} . 

It remains then to determine the probability of E which, by 
(3. 7) gives the reliability R(C). This is easily done if we consider 
the following events (subsets of {0, 1} x {0, l}* x F): 

N = {(q,x,f)[f is null or f = (t, n , i) where i>t } 

A Q = {(q,x,f)|q = 0} 

A i = {(q,x,f)[q = 1} 

B q = { (q, x, f) [Parity (x) = 0} 

B ^ = {(q,x,f) | Parity (x) = 1} 

Cq “ {(q, x, f) |f = (r, ir , i) where i < t and r = it = (7q} 

C 1 = { (q,x, f) |f = (r, 7T , i) where i < t and r = tt - 0 ^} . 

(Oq and o ^ denote the functions "constant 0" and "constant 1. ") 

These events can be paraphrased as follows: 

N: No fault occurs in the interval [0,t). 

A.: The initial state is i. 

1 

B^: The number of l’s in the input sequence is equal to i (mod 2). 

C^: A stuck-at-i fault occurs in the interval [0, t). 

We note first that N c. e since, here, no fault occurs before the end of 
the utilization period. To determine other events in E, consider, for 
example, the event D = AqB^C^ (the intersection of these three events). 
If (q,x, f) e D, since f is a stuck -at-1 fault that occurs during the 

f 

utilization period, the final state of a (x) is equal to 1. But q = 0 and 

'■i 
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and the parity of x is odd, so the final state of a (x) is also equal to 
1. In other words, such computations are T-error-free and we 
condlude that D C E. By similar reasoning for other compound events, 
it can be established that: 

E = N UA 0 B 0 C 0 UAjBjCq UA 1 B q C 1 

Since the events on the right-hand side are mutually exclusive and 
assuming the events A^, B^ and are independent (a reasonable 
assumption in this case) we have: 

P(E) = P(N) + [P(A 0 )P(B q ) + P(A 1 )P(B 1 )]P(C 0 ) 

+ [P(A 1 )P(B 0 ) + P(A q )P(B 1 )]P(C 1 ) . 

Here, under the earlier assumption regarding failure rates, 

P(N) = e' Xt 

1 :* * 

and, since Cq UC^ ="N, 

P(C 0 ) + PCCj) = 1 - e~ Xt . 

If we assume further that all sequences in {0, are equally likely 
and the initial state is always 0 then 

P(A q ) = 1 P(B q ) = 0.5 

P(AJ = 0 P(B 1 ) = 0. 5 . 
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Substituting these values in the above expression, we obtain the 
reliability of C, that is, 

R(C) = P(E) = e _Xt + 0. 5(1 - e' Xt ) . (3. 9) 

Comparing (3. 9) with (3. 8), we see that the reliability of the 
flip-flop is greater when error -free behavior is judged according to 
the tolerance relation T. Thus, for example, if the failure rate \ is 

_4 

10 failures /hour and the utilization period is 1000 hours then, 
according to (3. 8): 

R(C) = e" 0, 1 = 0. 90 . 

On the other hand, applying (3. 9): 

R(C) = e' 0, 1 + 0. 5(1 - e'° - 1 ) = 0.95. 

If some other set of assumptions were made regarding the probabilistic 
nature of C (i. e. , the probability measure P), the extent of the improve- 
ment might differ but, in no case, would R(C) be less than R(C). 

3.1.5 Topics for Further Investigation. The ability to define 
a reliability measure and apply it in the manner illustrated above 
demonstrates both the feasibility and the potential of this approach 
to. reliability analysis. Due to the generality of the framework, there 
are relatively few limitations on the types of systems, faults, tolerance 
relations, and reliability measures that are describable within the 
formalism. However, to say that something is describable (in the 
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sense that a description exists) is usually not enough; it must also 
be the case that an object remains describable when the description 
process is subject to constraints on money, manpower, time and space. 
If computers are used in the description process (which will be 
mandatory in many cases), time includes computer run-time and 
space includes computer storage. Therefore, in parallel with our 
further investigation of appropriate tolerance relations and reliability 
measures we intend to investigate various means of simplifying their 
description. If the price paid for simplification is a less accurate 
description, the effects of such inaccuracies will also be investigated. 

These remarks regarding the economics of the description 
process apply as well to the process of evaluating the reliability of 
a computer according to some reliability measure, given that the 
computer (with faults) and the measure have already been described. 

One possible means of simplifying a complex evaluation process would 
be to decompose it via a decomposition of the measure. In other words, 
attempt to define submeasures which can be more easily evaluated 
and, in turn, combined to yield the value of the measure. We believe 
this approach deserves investigation and we intend to actively pursue 
it‘ during the next reporting period. In cases where the calculation 
of exact values is infeasible, due to the computational complexity of the 
algorithm or to insufficient knowledge of the underlying probability 
measure, methods of calculating approximate values will be investigated. 
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If the measure is decomposed, as suggested above, such methods 
could also be used to calculate the values of submeasures. 

Aside from these questions regarding how measures are des- 
cribed and how they are evaluated, a more fundamental question is 
the determination of just what should be measured. The strict 
reliability measure discussed in the previous section (or, more 
properly, the class of strict reliability measures, since the tolerance 
relation T can be varied) is but one measure among many that 
could be applied. In viewing what others have accomplished' 
in this regard, we believe that "recoverability, " or what 
is usually referred to as "coverage” [9] , deserves much more study 
with regard to its measure. This includes all the aspects we have 
discussed for reliability measures in general, that is, a precise 
definition of what is meant by recoverability (as defined on computers 
with faults), economic descriptions of whatever objects the recover- 
ability measure is based on, and an efficient means of evaluating the 
recoverability of a given computer. 

In support of the need for such an investigation, Bouricius, , 

et al. [9] state that . . coverage... is the single most important para- 
meter in high -re liability system design. Changing the coverage from 1 
to about 0. 98 can result in orders of magnitude degredation in system 
mission time.” Also, we have programmed several reliability functions 
involving a coverage parameter, and have observed, through interactive 
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terminal sessions, that coverage is indeed a critical parameter. 
Moreover, the relative importance of coverage increases with increas- 
ing system reliability. 

As a consequence of this assessment, just prior to the end of 
the present reporting period we initiated an investigation of recoverability 
measures and their evaluation. The initial step has been to explore 
just what is meant by "recovery” and "time of recovery" in the 
context of a computer with faults and relative to some specified toler- 
ance relation. Once these questions are settled, precise formulation 
of a recovery measure will begin and the investigation will proceed 
as outlined above. 

3. 2 On-line Diagnosis 

3. 2. 1 Background. In many applications, especially those in 
which a computer is being used to control some process in real- 
time, (e.g. , telephone switching, flight control of an aircraft or 
spacecraft, etc. ) it is desirable to constantly monitor the performance 
of the system, as it is being used, to determine whether the actual 
system is within tolerance of the intended system. Informally, by 
"on-line diagnosis" we mean a monitoring process of this type where 
the extent of the diagnosis depends on the meaning of "within tolerance. ” 
Thus, for example, if being within tolerance means having the same 
input -output behavior, then on-line diagnosis becomes on-line 
"detection. " In the special case where the implementation of on-line 
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diagnosis is completely internal to the system being diagnosed, it 
is referred to as "self diagnosis” or "self checking. ” 

The incorporation of special hardware for the purpose of on-line 
diagnosis dates way back to the relay computers developed by Bell 
Laboratories in the early -to -mid 1940’s, where biquinary codes 
were used to dynamically check the operation of the computer [10]. 

A more general look at codes for checking logical operations was 
first taken by Peterson and Rabin in 1959 [11] where they showed 
that combinational circuits can vary greatly in their inherent on-line 
diagnosability. The use of coding techniques in the design of self- 
checking circuits was further explored by Carter and Schneider in 
1968 [12] and by Anderson in 19^1 [13 ]. In addition, a number of 
special on-line diagnosis methods have been considered which apply 
to specific hardware subsystems such as adders, counters, etc. 

(see [14], for example). 

A theoretical study of on-line fault diagnosis was initiated under 
NASA Grant NGR23 -00 5-463. The motivation for this study was the 
increasing use of computers in real-time applications where (i) erron- 
eous operation can result in the loss of human life and/or large sums 
of money and $i) interruptions in the operation, for the purpose of 
off-line diagnosis, are intolerable. In particular, our discussions 
with NASA-Langley regarding such applications were influential in 
precipitating this study. 
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The initial problem considered in our study was the formulation 

of an appropriate class of system models (i. e. , a class of "systems 

with faults") that could serve as a basis for the theoretical study of 

on-line diagnosis. This effort was motivated by the observations 

that (i) conventional models of time- invariant systems (e. g. , sequential 

machines) are inadequate since they cannot represent the dynamics 

of a system as faults occur and (ii) many systems are originally 

designed with an explicit reset mechanism (e. g. , the clear button on 

an adding machine) or must have a reset capability due to their 

intended implementation. These observations led to the formulation 

of a class of resettable discrete -time systems which adequately 

represent the structure and behavior of both "fault-free" and "faulty" 

systems in an on-line diagnosis environment. Given a (resettable, 

discrete -time) system S, a fault f of S is represented by a triple 

f = (S r ,r, 6) with the interpretation that S is transformed into system 

S' at time r with transient state behavior 0. The result of f is taken 

f 

to be the system S which looks like S up to time r and like S 1 there- 
after. 

Once such systems were defined, the next problem considered 
was the formulation of notions of fault tolerance, error, diagnos ability, 
realization, etc. that have a meaningful interpretation in the context 
of on-line diagnosis. To summarize briefly, if S is a system and f 
is a fault of S, we say that f is tolerated if the resulting faulty system 
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S f is able to mimic some desired behavior as specified by a reduced 
systems. Otherwise, f causes errors (i. e. , erroneous outputs) 
for some initial conditions and input sequences. Our notion of on-line 
diagnosis involves an external detector D (assumed to be fault -free) 
and a maximum delay k within which any error caused by a fault must 
be detected. More specifically, a system S with faults F is (D, k)- 
diagno sable if, for all f in F, 

(i) D responds negatively if S is fault -free, 

and 

(ii) D responds positively within k time steps of the first 
occurrence of an error caused by f. 

After the above concepts were made precise, certain fundamental 
questions were posed and their investigation was initiated. The 
research outlined above was first described in the technical report 
"On-line Diagnosis of Sequential Systems" [15], 

3. 2. 2 Recent Activity. During this reporting period we have 
continued our investigation of on-line diagnosis and we have obtained 
results which have substantially increased our knowledge of the subject. 
The activity during this period has focused on the diagnosis of two 
sets of faults; namely, the set of "unrestricted faults" and the set 
of "unrestricted component faults. " 

The set U of unrestricted faults of a system is defined to be 
simply the set of all faults of that system. Aside from representing 
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a "worst -case" fault environment, there are certain practical reasons 
for considering U, at least at the outset. In particular, as the scale 
of integrated circuit technology becomes larger, it becomes more 
difficult to postulate a suitably restricted class of faults such as the 
class of all "stuck-at" faults. Moreover, although other failure 
models such as bridging failures have been proposed and studied 
(see [16] and [17] for example), little is known about the diagnosis 
of such failures. In addition, intermittent and multiple failures are 
also possible and are even more difficult to model. Finally^ for a 
given failure it may be impossible to determine the 6 function of the 
fault caused by this failure. Thus fault sets which do not restrict the 

transient state behavior 0 are advantageous. 

Given the background of techniques that have been proposed and, 
in many cases, used to improve the on-line diagnosability of a system, 
the following question arises quite naturally. With regard to any 
technique that might be employed, how complex must the diagnosing 
system be as compared to the system being diagnosed, if the latter 
is to be on-line diagnosable for some prescribed set of faults? To 
answer this question, one must, of course, designate the complexity 
measure. As a measure of system complexity, we have chosen the 
number of reachable internal states. This measure reflects the memory 
capacity of a system and, without further restrictions on system struc - 
ture, it*s the only measure of structural complexity that has a reason- 
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able interpretation* Here we have shown that if a system is on-line 
diagnosable for the unrestricted set of faults then the detector is at 
least as complex as the specification. Moreover, this result holds 
even when the allowed time delay for error detection is arbitrarily 
large. 

One means of diagnosing the unrestricted faults of a system is 
to use a detector which consists of a duplicate of the system being 
diagnosed and a matching circuit which can dynamically compare the 
operation of the system with its duplicate. For systems which have 
(delayed) inverses, that is, systems which are information lossless, 
an alternative means of performing unrestricted fault diagnosis is 
the use of a loop check. Our research here has established that an 
inverse system can always be used for on-line unrestricted diagnosis 
if it too is information lossless. Although the lossless condition is 
sufficient, it is shown further that there exist systems for which a 
lossy inverse can also be used for on-line unrestricted fault diagnosis. 

Since not every system has an inverse, let alone one which can 
be used for unrestricted fault diagnosis, it is not always possible to 
apply this technique directly. However, we have shown that every 
system has a realization to which this scheme can be successfully 
applied. 

A detailed discussion of the above results has been documented 
in a paper entitled "On-line Diagnosis of Unrestricted Faults" [18] 
which has been submitted for publication to the IEEE Transactions on 


Computers. 
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The on- line diagnosis of systems, which are structurally 
decomposed and are represented as a network of smaller systems, 
has also been investigated. The fault set considered here is the set 
of unrestricted component faults; namely, the set of faults which only 
affect one component system of the network. A characterization of 
networks which can be diagnosed using a combinational detector has 
been obtained and it is shown that any network can be made diagnosable 
in the above sense by the addition of one component with a complexity 
as great as the most complex component in the network. In addition, 
a lower bound has been obtained on the complexity of any component, 
the addition of which is sufficient to make a particular network 
combinationally diagnosable. 

A detailed discussion of all of the work to date on on-line 
diagnosis has recently been documented in the technical report 
"On-line Diagnosis of Sequential Systems: II " [19], This report includes 
modifications of material covered in an earlier report ("On-line 
Diagnosis of Sequential Systems" [15]), and rigorously establishes 
the results reviewed above. 

3. 2.3 Topics for Further Investigation. Although much progress 
has been made towards achieving a thorough understanding of on-line 
diagnosis, many possibilities for further investigation remain. Except 
for research on the diagnosis of networks of systems, our investiga- 
tion has been dealing with totally unstructured systems. Such an 
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approach is well suited to the development of formal concepts 
involved in the theory of on-line diagnosis. It is also well suited to 
the investigation of these concepts, provided that the faults in 
question do not depend on a more refined knowledge of structure as, 
for example, when the faults are unrestricted. On the other hand, 
many interesting and important questions are better studied in a more 
structured environment. One reason for this is that, with a structured 
system, we can consider the causes of faults. For example, given 
an unstructured system it makes no sense to speak of the set of 
faults caused by component failures of a certain type or by bridging 
failures. However, given a structured representation of a system 
(e. g. , a circuit diagram) we can discuss these and other types of 
failures (causes) and determine the resulting faults (effects). 

There are many different structural levels that could prove 
useful to a further investigation into the theory of on-line diagnosis. 

Two levels which we believe will be important are: the binary state - 
assigned level and the logical circuit level. These levels and the 
basis for their potential usefulness are explained below. 

A machine M is said to be binary state -assigned if Q = {0, l} n 
for some positive integer n. Given such a machine we can speak of 
stuck -at -0 and stuck-at-1 and other types of memory failure. The 
faults corresponding to these failures can be enumerated and compari- 
sons can be made between various schemes for diagnosing these faults. 
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Memory faults have been studied before in other contexts (see [20 ] 
and [21] for example) and they are an important class of faults for 
a number of reasons. As we have seen, only a limited amount of 
structure is needed to discuss them. Thus mempry faults can be 
analyzed before the circuit design of the machine is complete. Also, 
it is memory which distinguishes truly sequential systems from 
purely combinational (one -state) systems. Combinational systems 
are inherently easier than sequential systems to analyze and a number 
of techniques for the on-line diagnosis of such systems are known 
(see [14] and [22] for example). 

A system possesses structure at the logical circuit level if a 
representation of the system is given in terms of a logical circuit 
composed of primitive logical elements. These may be of the 
AND -OP variety, threshold elements, or any similar elements of a 
tT building block" nature depending upon the technology being considered. 
This level is useful for investigating failures in the primitive components. 

Further work could also be performed at the network level of 
structural detail. At this level one could study the problem of imple- 
menting on-line diagnosis on a whole computer whereas with the other 
levels the emphasis would be on diagnosing one module. Note that in 
our definition of diagnosis the detector is not constrained to give simply 
a yes -no response. It could also provide extra information for use 
in automatic fault location. Thus at this level the problem of which 
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subsystems must be explicitly observed by the detector to achieve 
some desired fault location property could be studied. 

One problem that requires extension of our present model 
(at any structural level) is the problem of automatic reconfiguration 
of the system under the control of the detector. To study this 
problem, the model used would have to allow for feedback from the 
detector to the system it is observing. The question of how such an 
extension should be made is an intersting one and, if answered 
satisfactorily, the resulting model could serve as a basis for a sys- 
tematic investigation of reconfiguration techniques. 
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